{janitor}

Caroline Kostrzewa, Sabrina Lin

Topic Tuesdays: 2024-10-01

FYI

  • Using three data sets which we have “messed up” in various ways (all from Kaggle)
    • Species
    • Characters
    • Films


* chisq and fisher test function (talk about masking)

Species

glimpse(species_messy)
Rows: 40
Columns: 11
$ id               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ name             <chr> "Human", "Yoda's species", "Wookiee", "Gungan", "Twi'…
$ classification   <chr> "Mammal", "Unknown", "Mammal", "Amphibian", "Mammal",…
$ designation      <chr> "Sentient", "Sentient", "Sentient", "Sentient", "Sent…
$ average_height   <dbl> 1.80, 0.66, 2.28, 1.96, 1.80, 1.70, 1.70, NA, 2.00, 1…
$ skin_colors      <chr> "Light, Dark", "Green", "Brown", "Orange", "Blue, Gre…
$ hair_colors      <chr> "Various", "White", "Brown", "None", "None", "None", …
$ eye_colors       <chr> "Various", "Brown", "Blue", "Orange", "Various", "Red…
$ average_lifespan <dbl> 79, 900, 400, 70, 80, 70, 78, NA, 70, 70, 80, 94, 100…
$ language         <chr> "Galactic Basic", "Galactic Basic", "Shyriiwook", "Gu…
$ homeworld        <chr> "Various", "Unknown", "Kashyyyk", "Naboo", "Ryloth", …

Characters

glimpse(characters_messy)
Rows: 96
Columns: 13
$ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ name        <chr> "Luke Skywalker", "Leia Organa", "Darth Vader", "Yoda", "H…
$ species     <chr> "Human", "Human", "Human", "Yoda's species", "Human", "Woo…
$ gender      <chr> "Male", "Female", "Male", "Male", "Male", "Male", "Male", …
$ height      <dbl> 1.72, 1.50, 2.02, 0.66, 1.80, 2.28, 1.82, 1.73, 1.88, 1.65…
$ weight      <dbl> 77, 49, 136, 17, 80, 112, 81, 75, 84, 45, 89, 84, 66, 80, …
$ hair_color  <chr> "Blond", "Brown", "None", "White", "Brown", "Brown", "Whit…
$ eye_color   <chr> "Blue", "Brown", "Yellow", "Brown", "Hazel", "Blue", "Blue…
$ skin_color  <chr> "Light", "Light", "Pale", "Green", "Light", "Brown", "Ligh…
$ year_born   <dbl> 19, 19, 41, 896, 29, 200, 57, 82, 41, 46, 92, 72, 52, 102,…
$ homeworld   <chr> "Tatooine", "Alderaan", "Tatooine", "Unknown", "Corellia",…
$ year_died   <dbl> 34, 35, 4, 4, 34, NA, 0, 35, 4, 19, 32, 19, NA, 19, NA, NA…
$ description <chr> "The main protagonist of the original trilogy.", "A leader…

Films

glimpse(films_messy)
Rows: 11
Columns: 6
$ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
$ title         <chr> "Episode I: The Phantom Menace", "Episode II: Attack of …
$ release_date  <date> 1999-05-19, 2002-05-16, 2005-05-19, 1977-05-25, 1980-05-…
$ director      <chr> "George Lucas", "George Lucas", "George Lucas", "George…
$ producer      <chr> "Rick McCallum", "Rick McCallum", "Rick McCallum", "Gary…
$ opening_crawl <chr> "Turmoil has engulfed the Galactic Republic...", "There …

Cleaning

  • clean_names (cleaning) [mess up names from kaggle species - spongebob meme]
  • make_clean_names (cleaning) [kaggle species data set - change average to mus]
    • mu_to_u
  • get_dupes (cleaning) [kaggle characters - Saw Gurerra is duplicated]
  • convert_to_date (cleaning) [kaggle films - make sure dates are excel dates]
    • mention excel conversions specifically
  • remove_empty (cleaning) [make empty column to remove]
  • remove_constant (cleaning) [make constant column to remove]

Tidying

  • round_half_up [kaggle characters - height]
  • round_to_fraction [kaggle characters - height]
  • signif_half_up [kaggle characters - height]
  • compare_df_cols (checking) [split kaggle characters into humans vs non-human and mess one up, then use these to check before row binding]
  • compare_df_cols_same (checking) [split kaggle characters into humans vs non-human and mess one up, then use these to check before row binding]

tabyl

  • tabyl [kaggle characters - death status vs weight]
  • adorn_*
  • untabyl